GAMMs are not time series models (!)

This title is a bit provocative, yet strictly speaking it is correct. We will show two examples where we realise what is missing from GAMMs in order to become tools for time series analysis. In short, we need to communicate to the model something about the order of the time samples, which we have not done so far.

Consider the data below:

Let’s estimate a GAM:

m1 <- bam(y ~ s(time, k = 20), data = curves)
summary(m1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time, k = 20)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.1795848  0.0004607   389.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df    F p-value    
## s(time) 15.52   17.6 6060  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.915   Deviance explained = 91.5%
## fREML = -16489  Scale est. = 0.0021116  n = 9950
plot_smooth(m1, view = "time", col = "slateblue4", print.summary = FALSE, rug = FALSE)

Now, imagine that the same data set was not a collection of time series. For example instead of time we could have price of houses (in millions of euros), and y could be some measure of demand. The plot would be like this:

Question: How would you change the GAM above to reflect the fact that there are no curves but it is all one data set?

AR1 residuals

gam.check(m1)

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 9 iterations.
## Gradient range [-6.31743e-06,6.479967e-06]
## (score -16488.76 & scale 0.002111591).
## Hessian positive definite, eigenvalue range [8.040793,4974.011].
## Model rank =  20 / 20 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k'  edf k-index p-value
## s(time) 19.0 15.5    0.99    0.23

We note a strong evidence for heteroscedasticity. Where does it come from?

Look at the variation in the large peak. When y is high we are around the peak, and that is also where the largest variation across curves occurs, hence the wider residual span for higher fitted values.

Now let’s consider one particular curve, say one whose first peak is higher than the mean. As we see from the raw curves, we expect that if the residual error for this particular curve at say time = 0.4 is positive, it will also be positive in the next time samples. Hence neighbouring residuals are correlated, which violates one of the hypotheses of the model. We would like to express this fact in the GAMM.

A way to alleviate the problem above is to induce a structure in the residuals as follows: \[ \epsilon_i = \rho \cdot \epsilon_{i-1} + \psi_i \] which means that the residual at point \(i\) is proportional to the residual at point \(i-1\), i.e. one step earlier on the time axis, plus another unknown term \(\psi_i\), the latter values distributed as \(N(0, \sigma^2)\) and independent. The GAMM will not estimate the parameter \(\rho\) for us directly, rather we meed to set it ourselves. This type of sub-model for the noise is well known in signal processing and it’s called AR1, i.e. auto regressive model of order 1 (because it only depends on one step in the past).

What we typically do is this:

  1. Estimate a GAMM without AR1 term (like above)
  2. Compute autocorrelation of residuals
  3. Re-run the GAMM setting \(\rho\) equal to the autocorrelation value at lag 1

The following estimates \(\rho\) and plots the complete autocorrelation function for the residuals:

m1.acf <- acf_resid(m1)

This function estimates the correlation of residuals with themselves in the past at different lags. We take the value as lag 1 as our estimate for \(\rho\)

rho <- m1.acf[2]
# at index 1 we have lag 0, whose value is 1 by definition
# at index 2 we have lag 1, which is what we need
rho
##         1 
## 0.8045908

Now we re-run the GAMM. We set \(\rho\) by specifying the argument rho. We also need to indicate where the start of curves in the data are, otherwise the AR1 model will be also applied between the last and the first sample of curves (argument AR.start).

m1.ar1 <- bam(y ~ s(time, k = 20), data = curves,
              rho = rho, AR.start = curves$time == 0
              )
summary(m1.ar1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time, k = 20)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.179576   0.001376   130.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(time) 12.43  15.13 804.6  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.915   Deviance explained = 91.5%
## fREML = -21706  Scale est. = 0.0020926  n = 9950
plot_smooth(m1.ar1, view = "time", col = "slateblue4", rug = FALSE, print.summary = FALSE)
acf_resid(m1.ar1)
gam.check(m1.ar1)
## 
## Method: fREML   Optimizer: perf newton
## full convergence after 7 iterations.
## Gradient range [-1.51787e-05,1.483733e-05]
## (score -21705.86 & scale 0.002092649).
## Hessian positive definite, eigenvalue range [6.899551,4974.007].
## Model rank =  20 / 20 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k'  edf k-index p-value
## s(time) 19.0 12.4    1.04       1

The results show that:

  • The strong positive AR1 term is gone
  • but an AR pattern is still there
  • heteroscedasticity is still there too
  • the estimated smooth has a wider confidence band
  • the AR1-augmented model converged quite quickly

Curve-specific random factors

The way to explicitly represent the information about curves, i.e. which sample belongs to which curve, is to introduce a random smooth factor at the curve level:

m1.randCurves <- bam(y ~ s(time, k = 20) + 
                       + s(time, curveId, bs = "fs", m=1, k = 15)
                       , nthreads = 2
                       , data = curves)
summary(m1.randCurves)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(time, k = 20) + +s(time, curveId, bs = "fs", m = 1, k = 15)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.179574   0.004535    39.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df      F p-value    
## s(time)          17.69   18.6 671.54  <2e-16 ***
## s(time,curveId) 505.63  749.0  56.77  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.984   Deviance explained = 98.5%
## fREML = -24077  Scale est. = 0.00039985  n = 9950
plot_smooth(m1.randCurves, view = "time", col = "slateblue4", rug = FALSE, print.summary = FALSE)
op <- par(mfrow=c(2,2))
gam.check(m1.randCurves)
## 
## Method: fREML   Optimizer: perf newton
## full convergence after 7 iterations.
## Gradient range [-1.198123e-07,1.173967e-07]
## (score -24076.94 & scale 0.0003998499).
## Hessian positive definite, eigenvalue range [8.57444,4985.08].
## Model rank =  770 / 770 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                    k'   edf k-index p-value
## s(time)          19.0  17.7       1    0.43
## s(time,curveId) 750.0 505.6       1    0.42
acf_resid(m1.randCurves)
par(op)

The results show that:

  • There is no AR residual anymore
  • residuals look uncorrelated, no pattern visible
  • the estimated smooth has a wider confidence band
  • the curve level random smooth-augmented model takes more time to converge

As alternative to the mgcv library, you can try the sparseFLMM library, which incorporates curve-level random smooths by default and it performs an efficient estimation based on functional PCA. sparseFLMM is (in my opinion) generally less user-friendly than mgcv.

AR1 residuals and latent factors

This is a curve dataset with a 4-level factor F4. An AR1 residual with coefficient \(\rho =\) 0.9 is added to the 4 expected curves.

Let us first model it as: y ~ F4 + s(t, by = F4), i.e. a regular GAM with one factor and no AR1 residual.

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F4 + s(t, by = F4, k = 10)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.503876   0.001413 356.636  < 2e-16 ***
## F4A.2        0.482954   0.001998 241.709  < 2e-16 ***
## F4B.1       -0.011297   0.001998  -5.654 1.68e-08 ***
## F4B.2        0.455562   0.001998 228.000  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df     F p-value    
## s(t):F4A.1 8.638  8.962 12669  <2e-16 ***
## s(t):F4A.2 8.816  8.990 14264  <2e-16 ***
## s(t):F4B.1 8.551  8.942 12218  <2e-16 ***
## s(t):F4B.2 8.889  8.996  6972  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.992   Deviance explained = 99.2%
## fREML = -6733.5  Scale est. = 0.0020361  n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 8 iterations.
## Gradient range [-2.764712e-06,2.764141e-06]
## (score -6733.482 & scale 0.002036088).
## Hessian positive definite, eigenvalue range [3.777704,2036.029].
## Model rank =  40 / 40 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'  edf k-index p-value
## s(t):F4A.1 9.00 8.64    1.07       1
## s(t):F4A.2 9.00 8.82    1.07       1
## s(t):F4B.1 9.00 8.55    1.07       1
## s(t):F4B.2 9.00 8.89    1.07       1

Notice that:

  • The model explains most of the variance
  • Residual plots look good
  • The AR1 pattern is clear, and it looks like an exponential decay

Let us add the AR1 term to the model, taking the estimated value \(\hat{\rho} =\) 0.8851126:

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F4 + s(t, by = F4, k = 10)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.503812   0.004994 100.876   <2e-16 ***
## F4A.2        0.483003   0.007063  68.382   <2e-16 ***
## F4B.1       -0.011101   0.007063  -1.572    0.116    
## F4B.2        0.455781   0.007063  64.527   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df    F p-value    
## s(t):F4A.1 8.654  8.974 1630  <2e-16 ***
## s(t):F4A.2 8.819  8.993 1889  <2e-16 ***
## s(t):F4B.1 8.561  8.958 1586  <2e-16 ***
## s(t):F4B.2 8.895  8.998 1494  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.992   Deviance explained = 99.2%
## fREML = -10010  Scale est. = 0.0018469  n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 6 iterations.
## Gradient range [-5.816972e-05,5.753168e-05]
## (score -10009.53 & scale 0.001846912).
## Hessian positive definite, eigenvalue range [3.784681,2036.029].
## Model rank =  40 / 40 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'  edf k-index p-value
## s(t):F4A.1 9.00 8.65    1.07       1
## s(t):F4A.2 9.00 8.82    1.07       1
## s(t):F4B.1 9.00 8.56    1.07       1
## s(t):F4B.2 9.00 8.89    1.07       1

Notice that:

  • Explained var almost the same as without AR1
  • Residual plots still look good
  • The AR1 pattern is gone

Let us now change two things:

  1. Pretend we do not know there are 4 levels, but only two levels A and B
  2. Remove the AR1 noise and introduce uncorrelated noise

Let us model this dataset as: y ~ F2 + s(t, by = F2)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F2 + s(t, by = F2, k = 10)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.745345   0.006508 114.524   <2e-16 ***
## F2B         -0.018165   0.009204  -1.974   0.0485 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(t):F2A 5.541  6.686 752.7  <2e-16 ***
## s(t):F2B 6.539  7.675 435.9  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.673   Deviance explained = 67.4%
## fREML = 822.95  Scale est. = 0.086408  n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 8 iterations.
## Gradient range [-5.533282e-07,4.192454e-07]
## (score 822.9545 & scale 0.08640774).
## Hessian positive definite, eigenvalue range [2.154157,2038.006].
## Model rank =  20 / 20 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value    
## s(t):F2A 9.00 5.54     0.1  <2e-16 ***
## s(t):F2B 9.00 6.54     0.1  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that:

  • Explained var dropped significantly
  • Residual plots are awful, obvious patterns, even trajectories are visible
  • There is a very large AR pattern, which does not look like an exponential decay

Let us blindly apply the AR1 recipe, i.e. take the estimated \(\hat{\rho} =\) 0.9911937 and re-fit the model:

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F2 + s(t, by = F2, k = 10)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.74514    0.03829  19.458   <2e-16 ***
## F2B         -0.01803    0.05416  -0.333    0.739    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(t):F2A 8.194  8.870 193.5  <2e-16 ***
## s(t):F2B 8.610  8.969 254.0  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.672   Deviance explained = 67.4%
## fREML = -7751.7  Scale est. = 0.067817  n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 8 iterations.
## Gradient range [-2.848765e-09,2.66256e-09]
## (score -7751.663 & scale 0.06781707).
## Hessian positive definite, eigenvalue range [3.560744,2038.013].
## Model rank =  20 / 20 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value    
## s(t):F2A 9.00 8.19     0.1  <2e-16 ***
## s(t):F2B 9.00 8.61     0.1  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that:

  • Explained var did not improve
  • Residual plots are still awful, trajectories are still there
  • AR has changed, but has not gone away

Let us instead add a curve-specific random smooth term (and remove the AR1 term):

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ F2 + s(t, by = F2, k = 10) + s(t, curveId, by = F2, bs = "fs", 
##     m = 1)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.74534    0.03882  19.202   <2e-16 ***
## F2B         -0.01816    0.05356  -0.339    0.735    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                      edf  Ref.df       F p-value    
## s(t):F2A           8.023   8.139   73.92  <2e-16 ***
## s(t):F2B           8.592   8.645  112.93  <2e-16 ***
## s(t,curveId):F2A 348.851 359.000 1201.31  <2e-16 ***
## s(t,curveId):F2B 347.848 359.000 1184.22  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.998   Deviance explained = 99.9%
## fREML = -8195.9  Scale est. = 0.00040811  n = 4080

## 
## Method: fREML   Optimizer: perf newton
## full convergence after 16 iterations.
## Gradient range [-1.997467e-05,1.62132e-05]
## (score -8195.885 & scale 0.0004081082).
## Hessian positive definite, eigenvalue range [3.384527,2063.451].
## Model rank =  1460 / 1460 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                      k'    edf k-index p-value
## s(t):F2A           9.00   8.02    0.99    0.32
## s(t):F2B           9.00   8.59    0.99    0.33
## s(t,curveId):F2A 720.00 348.85    0.99    0.34
## s(t,curveId):F2B 720.00 347.85    0.99    0.30

Notice that:

  • Explained var reaches ceiling
  • Residual plots are prefect
  • AR is minimal
  • Random smooths clearly are not random, they capture the latent factor
  • Compute time is quite high (minutes, on a toy example with 80 curves)

Take home message

  • Do not apply AR1 recipe blindly
  • AC can be a consequence of a latent factor/variable
  • Yet the computational cost of the ‘proper’ solution (curve-specific smooths) is often prohibitively high